Crash early and crash often for more reliable software

Matt Klein
7 min readApr 7, 2019

I’ve increasingly noticed a disturbing trend in software engineering: the idea that any total program crash (via segmentation fault, panic, null pointer exception, assertion, etc.) is an indication that a piece of software is poorly written and cannot be trusted.

Although it’s true that, in some cases, crashes may be an indication of unreliable software and subpar development methods, crashing is also a valid error handling method that if used correctly can increase rather than decrease the overall quality, reliability, and velocity of a piece of software.

Code is a liability

Ultimately, every line of code in a program is a liability that may lead to software defects. One particularly pernicious source of code and defect volume is rarely (or never) used error handling code.

def doSomethingWithBar(bar):
if bar is None:
# Can this ever actually happen?
return False
bar.doSomething()
return True

In the previous example I show some Python which checks to see if bar is a null value before doing something. To indicate failure (bar is null), the function returns a boolean value (before the committee for functional programming purity comes after me, instead of checking for null, assume bar is an optional and check whether it has a value — the same idea applies).

This example may seem contrived, but I see this pattern frequently in code review, and my review comment is always: can this ever actually happen? Sometimes it can happen, but very often it cannot, and the engineer has written the code in this style because they are under the mistaken belief that all code must have error checking.

The only error checking a program needs are for errors that can actually happen during normal control flow.

If in the previous example bar can never be null, doSomethingWithBar() can be removed from the code entirely and the doSomething() method call inlined. Additionally, the caller of the function no longer needs to deal with the cognitive load of understanding whether and how they need to handle an error from a call to doSomethingWithBar().

This last point is critical: even though this example is contrived, it or something like it is frequently found in real codebases, and through long call chains of additional needless error handling it can quickly become impossible to reason about the code logic. This extraneous call chain logic makes the larger codebase harder to reason about, harder to maintain and enhance, and more likely to contain bugs.

What if the author of the code was wrong and bar can in fact be null? Let the program crash! The resulting crash and stack trace will be extremely obvious, easy to debug, and fix. It may become clear that the invariant of bar always being null is not true. If may also become clear that some caller of doSomething() is broken and should have a proper bar. Either way, the code will stay as simple as possible and without extraneous logic that is hard to reason about.

Assert early and assert often

A corollary to “crash early and crash often” is “assert early and assert often.” Asserts are a tremendously powerful mechanism to verify code invariants (things that should always be true). I favor two types of asserts:

  • Debug asserts: These asserts are compiled out in release builds and well and truly should never happen.
  • Release asserts: These asserts are not compiled out of release builds. They are used to check situations that may theoretically happen, but if they do, it would be preferable to crash and restart instead of trying to handle the error.

Debug assertions should outnumber release assertions by a very large factor.

I view assertions as an incredibly powerful tool for reducing code complexity for the following reasons:

  1. Assertions act as documentation. The reader clearly understands the state the program is expected to be in when the assertion runs.
  2. Assertions by definition limit extraneous branches and error handling. Why handle a state which should never happen?

For languages that do not contain builtin assertions (e.g., Go) I recommend creating a simple assertion wrapper that crashes the program if not satisfied.

What happens if the programmer is incorrect and an assertion is not valid in all cases? Let the program crash! Like the null check in the previous section, an assertion failure is trivial to debug. The fix may either be to remove the assertion and add handling code, or to make the calling code adhere to the assertion.

Using code coverage and style guidelines to limit error handling complexity

Although code coverage is an imperfect tool (a high coverage percentage does not necessarily mean a program has quality test coverage), enforcing a very high level of coverage is very useful in yielding liberal use of assertions and limiting error handling to only what is necessary.

In the Envoy error handling guidelines we write:

Tip: If the thought of adding the extra test coverage, logging, and stats to handle an error and continue seems ridiculous because “this should never happen”, it’s a very good indication that the appropriate behavior is to terminate the process and not handle the error. When in doubt, please discuss.

In essence, we force all error branches and assertions to be covered by tests. If the programmer feels this time is wasteful it’s a useful forcing function for code logic simplification.

Ownership semantics used to prevent crashes can lead to complexity and bugs

One other area in which I often see extra complexity and bugs added in order to ostensibly prevent crashing is in object/data ownership semantics. Fundamentally, there are three different ways data can be allocated and tracked in a program:

  1. Stack
  2. Heap with a single owner (e.g., std::unique_ptr<> in C++, standard borrow checking in Rust, etc.). Note that in many popular languages this ownership type is not available in practice because all heap allocated objects are reference-counted and allow for possible unintentional sharing (e.g., Java, Python, JS, Go, etc.).
  3. Heap with multiple owners (e.g., std::shared_ptr<> in C++, Java, Go, Python, Rust shared pointers, etc.).

Stack allocation is relatively simple and easy to understand so I’m going to primarily discuss how (2) and (3) relate to crashing early and code complexity.

At a high level, code which uses heap data with a single owner is substantially easier to reason about than code that uses reference-counted data. A single piece of code allocates data and a single piece of code frees it. Very simple. The alternative is shared ownership. The use of shared ownership can make code extremely difficult to reason about. How and when will an object be freed? Will there be any memory leaks due to circular references? (Somewhat ironically, I’ve seen many more production memory leaks in software written in Java and Python due to circular shared references vs. well written C++ that makes heavy use of single owner semantics.)

The downside of the single ownership approach is the ease of creating “use after free” situations in C/C++. Rust avoids “use after free” entirely with the borrow checker, while still allowing for single owner semantics. This is incredibly powerful from both a correctness perspective as well as a single data owner perspective and I look forward to the day in which most code is written in languages with Rust-like semantics. That said, given that the majority of code in the world is still written in C/C++, Java, Python, JS, and similar languages, I will continue this discussion from that viewpoint.

In C/C++, a “use after free” crash can sometimes be difficult to debug (again making Rust borrow semantics very enticing from a productivity standpoint), but it is very clearly a sign that the program crashed and an invariant has been violated. The alternative that I sometimes see tried in C/C++ is favoring Java/Python-like shared object ownership in an effort to avoid these types of crashes. The thinking goes that if an object is never freed while there is a reference to it, the program will never crash. Yet in my experience this inevitably leads to greater code complexity and more bugs due to circular references, hard to reason about logic, etc.

Only use shared memory ownership when the program logic actually calls for it.

For the same reasons that I advocate for limited error handling and extra assertions above, using single owner semantics is preferred. In Rust, the compiler will verify correctness. In C/C++ the compiler will not, but letting the program crash and fixing the uncovered invariant violation is far preferable to introducing needless ownership and code complexity in an attempt to avoid crashes of this type altogether.

For languages that do not allow for explicit single owner semantics, I recommend aggressively setting references to null when no longer in use. This reduces the chance of circular references and should make effective use after free issues more clear.

Conclusion

Limiting software complexity is one of the primary mechanisms available to us to limit defects. Very often, invariant violations that cause a fatal crash are substantially easier to debug and fix than complex code that attempted to prevent the crashes in the first place. Specifically, I recommend using the following three techniques for limiting error handling and code complexity:

  1. Limit error handling to only errors that can actually happen during normal control flow. Crash otherwise.
  2. Liberally use assertions to document invariant state and crash if violated.
  3. Use single owner data semantics if at all possible to limit code complexity, and if doing this using C/C++, let the program crash if an ownership invariant is violated.

The three strategies above will limit code complexity and generally yield bugs that are more obvious and easier to fix.

--

--